Category Levels in Hierarchical Text Categorization
نویسندگان
چکیده
We consider the problem of assigning level numbers (weights) to hierarchically organized categories during the process of text categorization. These levels control the ability of the categories to attract documents during the categorization process. The levels are adjusted in order to obtain a balance between recall and precision for each category. If a category's recall exceeds its precision, the category is too strong and its level is reduced. Conversely, a category's level is increased to strengthen it if its precision exceeds its recall. The categorization algorithm used is a supervised learning procedure that uses a linear classifier based on the category levels. We are given a set of categories, organized hierarchically. We are also given a training corpus of documents already placed in one or more categories. From these, we extract vocabulary, words that appear with high frequency within a given category, characterizing each subject area. Each node's vocabulary is filtered and its words assigned weights with respect to the specific category. Then, test documents are scanned and categories ranked based on the presence of vocabulary terms. Documents are assigned to categories based on these rankings. We demonstrate that precision and recall can be significantly improved by solving the categorization problem taking hierarchy into account. Specifically, we show that by adjusting the category levels in a principled way, that precision can be significantly improved, from 84% to 91%, on the much-studied Reuters-21578 corpus organized in a three-level hierarchy of categories.
منابع مشابه
Hierarchical text categorization using fuzzy relational thesaurus
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category. The goal of our approach is twofold; to develop a reliable t...
متن کاملUsing Boolean Rule Extraction for Taxonomic Text Categorization for Big Data
Categorization hierarchies are ubiquitous in big data. Examples include MEDLINE’s Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a categ...
متن کاملKybernet Ika Volume Number Pages Hierarchical Text Categorization Using Fuzzy Relational Thesaurus
Text categorization is the classi cation to assign a text document to an appropriate category in a prede ned set of categories We present a new approach for the text cate gorization by means of Fuzzy Relational Thesaurus FRT FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category The goal of our approach is twofold to develop a reliable text cat...
متن کاملHierarchical vs. flat n-gram-based text categorization: Can we do better?
Hierarchical text categorization (HTC) refers to assigning a text document to one or more most suitable categories from a hierarchical category space. In this paper we present two HTC techniques based on kNN and SVM machine learning techniques for categorization process and byte n-gram based document representation. They are fully language independent and do not require any text preprocessing s...
متن کاملSimilarity relations in visual search predict rapid visual categorization.
How do we perform rapid visual categorization?It is widely thought that categorization involves evaluating the similarity of an object to other category items, but the underlying features and similarity relations remain unknown. Here, we hypothesized that categorization performance is based on perceived similarity relations between items within and outside the category. To this end, we measured...
متن کامل